On the Interpretation of Bootstrap Trees: Appropriate Threshold of Clade Selection and Induced Gain
نویسندگان
چکیده
In this study we address the problem of interpreting a bootstrap tree. The main issue is choosing the threshold of clade selection in order to separate reliable clades from unreliable ones, depending on their bootstrap proportion. This threshold depends on the chosen error measure. We investigate error measures that stem from a generalization of Robinson and Foulds’ (1981) distance, used to quantify the divergence between the true phylogeny and the estimated trees. We propose two analytical approximations of the optimum threshold of clade selection to interpret (i.e., reduce) the bootstrap tree. We performed extensive simulations along the lines of Kuhner and Felsenstein (1994) using the neighbor-joining and the maximum-parsimony methods. These simulations show that our approximations cause only small losses in quality when compared to the optimum threshold resulting from empirical observation. Next, we measured the error reduction achieved when estimating the true phylogeny by the properly reduced bootstrap tree rather than by the complete original tree, obtained with a classical tree-building method. Our simulations on short sequences show that an error reduction of 39% is achieved with the parsimony method and an error reduction of 33% is achieved with the distance method when the error is measured with the standard Robinson and Foulds distance. The observed error reduction is shown to originate from an important decrease in Type I error (wrong inferences), while Type II error (omitted correct clades) is only slightly increased. Greater error reduction is achieved when shorter sequences are used, and when more importance is given to Type I error than to Type II error. To investigate the causes of error from another point of view, we propose a general decomposition of the error expectation in two terms of bias, and one of variance. Results for these terms show that no fundamental bias is introduced by the bootstrap process, the only source of bias being structural (lack of resolution). Moreover, the variance in the estimations is greatly reduced, providing another explanation for the better results of the reduced bootstrap tree compared with the original tree estimate.
منابع مشابه
Phylogenetic relationships of the commercial marine shrimp family Penaeidae from Persian Gulf
Phylogenetic relationships among all described species (total of 5 taxa) of the shrimp genus Penaeus, were examined with nucleotide sequence data from portions of mitochondrial gene and cytochrome oxidase subunit I (COI). There are twelve commercial shrimp in the Iranian coastal waters. The reconstruction of the evolution phylogeny of these species is crucial in revealing stock identity that ca...
متن کاملPhylogenetic relationships of the commercial marine shrimp family Penaeidae from Persian Gulf
Phylogenetic relationships among all described species (total of 5 taxa) of the shrimp genus Penaeus, were examined with nucleotide sequence data from portions of mitochondrial gene and cytochrome oxidase subunit I (COI). There are twelve commercial shrimp in the Iranian coastal waters. The reconstruction of the evolution phylogeny of these species is crucial in revealing stock identity that ca...
متن کاملComparing different stopping criteria for fuzzy decision tree induction through IDFID3
Fuzzy Decision Tree (FDT) classifiers combine decision trees with approximate reasoning offered by fuzzy representation to deal with language and measurement uncertainties. When a FDT induction algorithm utilizes stopping criteria for early stopping of the tree's growth, threshold values of stopping criteria will control the number of nodes. Finding a proper threshold value for a stopping crite...
متن کاملOutlier Detection by Boosting Regression Trees
A procedure for detecting outliers in regression problems is proposed. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev’s inequality applied to the maximum over the boosting iterations of ...
متن کاملGenetic Variation and Inheritance of Early Growth Characteristics in Three Wild Pistachio Populations
Pistacia atlantica is the most important tree species for the economy of many rural areas in west of Iran, but no effort has been made for the genetic improvement of this species. The aim of this investigation was to study the genetic variation and inheritance of early growth traits in P. atlantica. For this purpose, three wild pistachio populations comprising 60 randomly selected adult trees f...
متن کامل